New Arabic Medical Dataset for Diseases Classification
نویسندگان
چکیده
The Arabic language suffers from a great shortage of datasets suitable for training deep learning models, and the existing ones include general non-specialized classifications. In this work, we introduce new Arab medical dataset, which includes two thousand documents collected several websites, in addition to Medical Encyclopedia. dataset was built task classifying texts 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver Nephrological) diseases. Experiments on were performed by fine-tuning three pre-trained models: BERT Google, Arabert that based with large corpus, AraBioNER corpus.
منابع مشابه
A Dataset for Arabic Textual Entailment
There are fewer resources for textual entailment (TE) for Arabic than for other languages, and the manpower for constructing such a resource is hard to come by. We describe here a semi-automatic technique for creating a first dataset for TE systems for Arabic using an extension of the ‘headline-lead paragraph’ technique. We also sketch the difficulties inherent in volunteer annotators-based jud...
متن کاملASTD: Arabic Sentiment Tweets Dataset
This paper introduces ASTD, an Arabic social sentiment analysis dataset gathered from Twitter. It consists of about 10,000 tweets which are classified as objective, subjective positive, subjective negative, and subjective mixed. We present the properties and the statistics of the dataset, and run experiments using standard partitioning of the dataset. Our experiments provide benchmark results f...
متن کاملThe Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content
The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal...
متن کاملThe Enron Corpus: A New Dataset for Email Classification Research
Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze...
متن کاملImproved Classification of Medical Universities in Iran, a New Approach
Background: In order to check the practicality of classification of Universities of Medical Sciences (UMSs) based on their infrastructures, and scientific contributions, this study explored the most appropriate indicators to measure the size and productivity of UMSs. Methods: In the first phase, we approached a group of experts who had a deep experience in the management of UMSs and in the mini...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2021
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-030-91608-4_20